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Abstract —To increase efficacy in traditional classroom courses 
as well as in Massive Open Online Courses (MOOCs), automated 
systems supporting the instructor are needed. One important 
problem is to automatically detect students that are going to 
do poorly in a course early enough to be able to take remedial 
actions. Existing grade prediction systems focus on maximizing 
the accuracy of the prediction while overseeing the importance of 
issuing timely and personalized predictions. This paper proposes 
an algorithm that predicts the final grade of each student in a 
class. It issues a prediction for each student individually, when the 
expected accuracy of the prediction is sufficient. The algorithm 
learns online what is the optimal prediction and time to issue 
a prediction based on past history of students’ performance in 
a course. We derive a confidence estimate for the prediction 
accuracy and demonstrate the performance of our algorithm on a 
dataset obtained based on the performance of approximately 700 
UCLA undergraduate students who have taken an introductory 
digital signal processing over the past 7 years. We demonstrate 
that for 85% of the students we can predict with 76% accuracy 
whether they are going do well or poorly in the class after the 
4**' course week. Using data obtained from a pilot course, our 
methodology suggests that it is effective to perform early in-class 
assessments such as quizzes, which result in timely performance 
prediction for each student, thereby enabling timely interventions 
by the instructor (at the student or class level) when necessary. 

Index Terms —Forecasting algorithms, online learning, grade 
prediction, data mining, digital signal processing education. 


I. Introduction 

E ducation is in a transformation phase; knowledge 
is increasingly becoming freely accessible to everyone 
(through Massive Open Online Courses, Wikipedia, etc.) and 
is developed by a large number of contributors rather than 
by a single author |[T). Furthermore, new technology allows 
for personalized education enabling students to learn more 
efficiently and giving teachers the tools to support each student 
individually if needed, even if the class is large Q. 

Grades are supposed to summarize in a single number or 
letter how well a student was able to understand and apply 
the knowledge conveyed in a course. Thus it is crucial for 
students to obtain the necessary support to pass and do well 
in a class. However, with large class sizes at universities 
and even larger class sizes in Massive Open Online Courses 
(MOOCs), which have undergone a rapid development in the 
past few years, it has become impossible for the instructor 
and teaching assistants to keep track of the performance of 
each student individually. This can lead to students failing 
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in a class who could have passed if appropriate remedial 
actions had been taken early enough or excellent students not 
receiving the necessary promotion to benefit maximally from 
the course. Remedial or promotional actions could consist 
of additional online study material presented to the student 
in a personalized and/or automated manner ||^. Hence, in 
both offline and online education, it is of great importance 
to develop automated personalized systems that predict the 
performance of a student in a course before the course is over 
and as soon as possible. While in online teaching systems a 
variety of data about a student such as responses to quizzes, 
activity in the forum and study time can be collected, the 
available data in a practical offline setting are limited to 
scores in early performance assessments such as homework 
assignments, quizzes and midterm exams. 

In this paper we focus on predicting grades in traditional 
classroom-teaching where only the scores of students from 
past performance assessments are available. However, we 
believe that our methods can also be applied for online courses 
such as MOOCs. We design a grade prediction algorithm that 
finds for each student the best time to predict his/her grade 
such that, based on this prediction, a timely intervention can 
be made if necessary. Note that we analyze data from a digital 
signal processing course where no interventions were made; 
hence, we do not study the impact of inventions and consider 
only a single grade prediction for each student. However, our 
algorithm can be easily extended to multiple predictions per 
student. 

A timely prediction exclusively based on the limited data 
from the course itself is challenging for various reasons. First, 
since at the beginning most students are motivated, the score 
of students in early performance assessments (e.g. homework 
assignments) might have little correlation with their score in 
later performance assessments, in-class exams and the overall 
score. Second, even if the same material is covered in each 
year of the course, the assignments and exams change every 
year. Therefore, the informativeness of particular assignments 
with regard to predicting the final grade may change over the 
years. Third, the predictability of students having a variety 
of different backgrounds is very diverse. For some students 
an accurate prediction can be made very early based on the 
first few performance assessments. If for example a student 
shows an excellent performance in the first three homework 
assignments and in the midterm exam, it is highly likely 
that he/she will pass the class. For other students it might 
take more time to make an equally accurate prediction. If a 
student for example performs below average but not terribly 
at the beginning, it is risky to predict whether he/she is going 
to pass or fail and, therefore, to decide whether or not to 
intervene. This third challenge illustrates the necessity to make 
the prediction for each student individually and not for all at 
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the same time. 

The main contributions of this paper can be summarized as 
follows. 


1) We propose an algorithm that makes a personalized and 
timely prediction of the grade of each student in a class. 
The algorithm can both be used in regression settings, 
where the overall score is predicted, and in classification 
settings, where the students are classified into two (e.g. 
do well/poorly) or more categories. 

2) We accompany each prediction with a confidence esti¬ 
mate indicating the expected accuracy of the prediction. 

3) We derive a bound for the probability that the prediction 
error is larger than a desired value e. 

4) We exclusively use the scores students achieve in early 
performance assessments such as homework assign¬ 
ments and midterm exams and do not use any other 
information such as age, gender or previous GPA. This 
makes our algorithm applicable in all practical tradi¬ 
tional classroom and online teaching settings, where 
such information may not be available. 

5) Since the algorithm is learning from past years, the 
predictions become more accurate when more data from 
previous years become available. 

6) We demonstrate that the algorithm shows good robust¬ 
ness if different instructors have taught the course in past 
years. 

7) We analyze real data from an introductory digital signal 
processing course taught at UCLA over 7 years and use 
the data to experimentally demonstrate the performance 
of our algorithm compared to benchmark prediction 
methods. As benchmark algorithms we use well known 
algorithms such as linear/logistic regression and k- 
Nearest Neighbors, which are still a current research 
topic 14|-0. 

8) Based on our simulations, we suggest a preferred way 
of designing courses that enables early prediction and 
early intervention. Using data from a pilot course, we 
demonstrate the advantages of the suggested design. 


The rest of the paper is organized as follows. Section 
[H] discusses related work in the field of grade and GPA 
prediction in education. In Section III we introduce notation, 
define data structures, formalize the problem and present the 
grade prediction algorithm. We analyze the data, describe 
benchmark methods and present simulation results including 
our and benchmark algorithms in Section IV Finally, we draw 
conclusions in Section m 


II. Related Work 

Various smdies have investigated the value of standardized 
tests 0- Q admissions exams ng and GPA in previous 
programs ® in predicting the academic success of students in 
undergraduate or graduate schools. They agree on a positive 
correlation between these predictors and success measures 
such as GPA or degree completion. Besides standardized 
tests, the relevancy of other variables for predictions of a 
student’s GPA have been investigated, usually resulting in the 
conclusion that GPA from prior education and past grades 


in certain subjects (e.g. math, chemistry) pT), PI have a 
strongly positive correlation as well. Reference |11) observes 
that simple linear and more complex nonlinear (e.g. artificial 
neural network) models frequently lead to similar prediction 
accuracies and concludes that there is either no complex 
nonlinear pattern to be found in the underlying data or the pat¬ 
tern cannot be recognized by their approach. Our simulations 
support the statement that simple linear models show a similar 
accuracy in grade predictions as more complex methods. 

Reference ID argues that the accuracy of GPA predictions 
frequently is mediocre due to different grading standards used 
in different classes and shows a higher validity for grade 
predictions in single classes. Consequently, many works focus 
on identifying relationships between a student’s grade in a 
particular class and variables related to the student |T4)-@. 
Relevant factors were found to include the student’s prior 
GPA |T4| , p3) , p9|-pT|, p3] , performance in related courses 
1 20 1, 21 , |23| , previous semester marks E). performance in 
entrance exams Gg, performance in early assignments of the 
class pT) , pT) , class attendance | [Tg , self-efficacy p^ and 
whether the student is repeating the class m- 

A limitation of the algorithms in the previously discussed 
papers is that they are difficult to apply in many education 
scenarios. Frequently, variables related to the student such 
as performance in related classes, GPA or self-efficacy are 
not available to the instructor because the data has not been 
collected or is not accessible due to privacy reasons. However, 
the instructor always has access to data he/she collects from 
his/her own course, such as the performance of each student in 
early homework assignments or midterm exams. This paper, 
therefore, focuses on predicting the final grade based on 
this easily accessible data, which is collected anyway by the 
instructor. 

Other works |24|- m, which also exclusively use data 
from the course itself, differ significantly from this paper 
in several aspects. First, they rely on logged data in online 
education or Massive Open Online Course (MOOC) systems 
such as information about video-watching behavior, time spent 
on specific questions or forum activity. In contrast, our results 
are applicable to both online and offline courses, which include 
some kind of graded assignments or related feedback from the 
students during the course. Second, in order for the instructor 
to be able to take corrective actions it is of great importance to 
predict with a certain confidence the performance of students 
as early as possible. While our algorithm takes this into 
account by deciding for each student individually the best time 
to make the prediction using a confidence measure, related 
works do not provide a metric indicating the optimal time to 
predict. Third, while related works need training data from 
the course whose grades they want to predict, we show that 
we can use training data from past year classes of the same 
course. Finally, in contrast to algorithms from related work, 
which are only shown to be applicable to classification settings 
(e.g. pass/fail or letter grade), our algorithm can be used both 
in regression and classification settings. 

To make the predictions, related works use various data 
mining models such as regression models | |T4| , p6) , decision 
trees Gg-n), p5| , pg , p0| , support vector machines 
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TABLE I 

Comparison With Related Work 



&(22) 

jpl-p), (23| 

(14), (19), (21) 

|24) 

|25)-(30) 

Our Work 

Goal of Paper 

Find Relevant 

Features 

Predict Course 

Grade 

Predict Course 

Grade 

Predict Accuracy 

of Answer 

Predict Course 

Grade 

Predict Course 

Grade 

Features 

Other 

Course & Other 

Course & Other 

From Course 

From Course 

From Course 

Learning from Past Years 

n/a 

No 

No 

No 

No 

Yes 

Accuracy-Timeliness Trade-Off 

n/a 

No 

No 

No 

No 

Yes 

Regression / Classification 

n/a 

Classification 

Both 

Classification 

Classification 

Both 


114) , ||23)-|[25), neural networks fTS) , |27|, |29|, Bayesian 
classifiers |15| , | [25) , clustering |26| and nearest neighbor 
techniques |23| , |24) , p9) , pO) . 

Table summarizes the comparison between our paper and 
related work investigating and predicting student performance 


m a course. 


III. Formalism, Algorithm and Analysis 

In this section we mathematically formalize the problem 
and propose an algorithm that predicts the hnal score or a 
classification according to the final grade of a student with a 
given confidence. 


A. Definitions and System Description 

Consider a course which is taught for several years with 
only slight modifications. Students attending the course have to 
complete performance assessments such as graded homework 
assignments, course projects and in-class exams and quizzes 
throughout the entire course|^ Our goal is to predict with a 
certain conhdence the overall performance of a student before 
all performance assessments have been taken. See Fig. for 
a depiction of the system. 

We consider a discrete time model with y = l,2,...,y 
and k = 1,2,... ,K where y denotes the year in which the 
course is taught and k the point in time in year y after the 
kth performance assessment has been graded. Y gives the total 
number of years during which the course is taught and K is the 
total number of performance assessments of each year. For a 
given year y we use index z as a representation of zth student 
of the year and ly to denote the total number of students 
attending in year y. Except for the rare case that a student 
retakes the course, the students in each year are different. Let 
o-i,y,k G [0,1] denote the normalized score or grade of student 
i in performance assessment k of year y. 

The feature vector of j/th year student i after having 
taken performance assessment k is given by y ^ = 

(oi^y.i,..., Qi^y^k)- The normalized overall score Zi^y G [0,1] 
of yth year student i is the weighted sum of all performance 
assessments 

K 

^t,y ~ ^ ^ tUk(li,y,k ( 1 ) 

fc=l 

where the Wk denote the weight of performance assessment k 
so that Y^^=i'^k = 1- The weights are set by the instructor 
and we assume that in each year the number, sequence and 

’The performance assessments are usually graded by teaching assistants, 
by the instructor or even by other students through peer review (g. 



Take Corrective Actions 
in Consequence of the 
Predicted Grade 




Fig. 1. System diagram for a single student. 


weight of performance assessments is the same. This assump¬ 
tion is reasonable since the content of a course usually does 
not change drastically over the years and frequently the same 
course material (e.g. course book) is usedj^This is especially 
true in an introductory course such as the one we investigate 


in Section IV The residual (overall score) Ci^y^k of yth year 


student i after performance assessment k is defined as 


_ jj2i=k+i'^i^t,v,i kG{l,...,K 1} 

- 1 0 k = K ^ ’ 


Using this definition we can write the overall score of yth year 
student i as 

k 

^ ^ ( 3 ) 

Note that after having taken the performance assessment k, 
the instructor has access to all the scores up to assignment k 
but the residual scores ^ need to be estimated. We denote 


^This assumption is made for simplicity. As we discuss in section |lV-B 
and show in Fig. we can apply our algorithm to settings where different 
instructors using a different number and sequence of performance assessments 
and using different weights for each performance assessment teach the course. 
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the estimate of the residual score for j/th year student i at 
time k by Ci^y^k and the corresponding estimate of the overall 
score by Zi^y^k- In binary classification settings, where the goal 
is to predict whether a student achieves a letter grade above 
or below a certain threshold, we denote the class of yth year 
student i by bi^y G {0,1}. 

For each student i we store the set of feature vectors 
Xi.j, = {xi y fe|A: G {1,. .. ,K}}, the set of residuals Ci_y = 
{ci^y^i,... ,Ci^y^K-i} and the student’s overall score Zi^y. All 
feature vectors from all students of year y are given by Xy = 
yfi'Li '^i,v and X = lj^=i denotes all feature vectors 
of all completed years. Similarly C = U^=i Ui=i ^i,v 
Z = lj^=i Ui=i ^i,v denote all residuals and overall scores of 
all completed years. Let X^ = {xi y_fc|fc = fc',Vi,y} denote 
the set of feature vectors and = {ci^y^k\k = k',\/i,y} 
denote the set of residuals saved after performance assessment 
k'. 


B. Problem Formulation 


Having introduced notations, definitions and data structures, 
we now formalize the grade prediction problem. We will 
investigate two different types of predictions. The objective 
of the first type, which we refer to as regression setting, 
is to accurately predict the overall score of each student 
individually in a timely manner. The second problem, referred 
to as classification setting, aims at making a binary prediction 
whether the student will do well or poorly or whether he/she 
will necessitate additional help or not. Again, the prediction 
is personalized and takes timeliness into account. For both 
types of predictions, the same algorithm can be used with only 


slight modifications, which we discuss in Section III-D We 
will also show that the binary prediction problem can easily 
be generalized to a classification into three or more classes. 

Irrespective of the type of the prediction, the decision for a 
yth year student i consists of two parts. First, we decide after 
which performance assessment k*y to predict for the given 
student and second we determine his/her estimated overall 
score Zi^y or his/her estimated binary classification bt^y. At a 
point in time k of year y all scores including the overall scores 
of all students of past years 1,... ,y — 1 are known. Thus all 
feature vectors x € X, residuals c G C and overall scores 
z G Z of all completed years are known. Furthermore, the 
scores Oi^y^i,..., ai^y^k of yth year student i up to assessment 
k are known as well and do not have to be estimated. However, 
to determine the overall score of the student we need to 
predict his/her residual score y ^ consisting of performance 
assessments k + 1,..., K since they lie in the future and are 
unknown. At time k we have to decide for each student of 
the current year whether this is the optimal time k* y = k to 
predict or whether it is better to wait for the next performance 
assessment. If we decide to predict, we determine the optimal 
prediction of the overall score y = y fe* . Both decisions 
are made based on the feature vector x^ y ^ of the given student 
and the feature vectors x G X^ and residuals c G of 
past students. To determine the optimal time to predict, we 
calculate a confidence qi,y{k) indicating the expected accuracy 
of the prediction for each student after each performance 


assessment. The prediction for a particular student is made 
as soon as the confidence exceeds a user-defined threshold 
qi,y{k) > qth- The problem of finding the optimal prediction 
time for yth year student i is formalized as follows: 


minimize k 

k 

subject to qi^y{k) > qth 


(4) 


The optimization problem results in the optimal prediction 
time kly. 


C. Grade Prediction Algorithm, Regression Setting 


In this section we propose an algorithm that learns to predict 
a student’s overall performance based on data from classes 
held in past years and based on the student’s results in already 
graded performance assessments. We describe the algorithm 
for the regression setting and explain the changes needed to 


use the algorithm in the classification setting in Section III-D 

Since at time k we know the scores ai,y,i, ■ •., ai,y,fc of 
the considered student from past performance assessments as 
well as the corresponding weights wi,..., Wk, we only predict 
the residual Q^y fc and calculate the prediction of the overall 
score with 0 - To make its prediction for the current residual 
of a student with feature vector x^ y_fc, the algorithm finds 
all feature vectors from similar students of past years and 
their corresponding residuals Q ^ fe. We define the similarity 
of students through their feature vectors. Two feature vectors 
Xi,Xj G X^ are similar if (xi,Xj)fc < r where {.,.)k is a 
distance metric defined on the feature space X^ and r is a 
parameter. For two feature vectors x G X^^ and x' G X^^ 
from different feature spaces (i.e. fci 7 ^ ^ 2 ) the distance metric 
is not defined since we only need to determine distances 
within a single feature space. Different feature spaces can 
have different definitions of the distance metric; we are going 
to define the distance metrics we use in Section IIV-BI We 
define a neighborhood B (x^, r) with radius r of feature vector 
Xc G X^ as all feature vectors x G X^ with (x^xj^ < r. 

Let (7^ denote the random variable representing the residual 
score after performance assessment k. (C^jx) denotes the 
probability distribution over the residual score for a student 
with feature vector x at time k and (x) denotes the student’s 
expected residual score. Let p^'(x) denote the probability dis¬ 
tribution of the students over the feature space X^. Intuitively 
p^(x) is the fraction of students with feature vector x at time 
k. Note that the distributions (C^jx) and p^(x) are not 
sampling distributions but unknown underlying distributions. 
We assume that the distributions do not change over the years. 

We define the probability distribution of the students in a 
neighborhood B (xc,r) with center x^ and radius r as 


:= 


/(x) 




xGB(Xc,r) 




dp^{-x) 


c,r)(x), 


where 1 is the indicator function. Intuitively r(x) is the 
fraction of students in neighborhood i3(xc, r) with feature vec¬ 
tor X. Let C’^{B{Kc,r)) be the random variable representing 
the residual score of students in neighborhood B{Xc,r) after 
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having taken performance assessment k. The distribution of 
C^(B(xc,r)) is given by 

JxGX'" 

We denote the true expected value of the residual scores 
after assignment k of students in a particular neighborhood 
by ixc,r) := £((7^ {B (Xc,r))). Note that 


(ate, r) = ^ [E [C'^jx]] = E^^^. ^ [p'^ (x)] 


/xex'= 


(x) dp[ 


k 

'X.ciT ' 


Our estimation of the true expected residual of students 
within a particular neighborhood B{xi^y^k,r) is given by 


fi{C^ {B {^^,y,k.r))) 


Cx,/c 

xG-B(xe„_fc,'r) 

l^(x*.y,fc,r)| 


(5) 


where Cx.fe denotes the residual after time k of the student 
with feature vector x. For notational simplicity, we use 
fi^{xi y i,,r) := {B {Ki^y^k^r))) to denote the estimated 
expectation. In the following we are going to derive how 
confident we are in the estimation of the residual score 
based on a given neighborhood B{x,r) and how we use this 
confidence q {B{x, r)) to both select the optimal radius of the 
neighborhood and to decide when to predict. 

Intuitively, if the feature vectors after performance assess¬ 
ment fc in a neighborhood B (x, r) of x contain a lot of 
information about the residual Cx,fc, past students with feature 
vectors in this neighborhood should have had similar residuals. 
Hence, the variance of the residuals Var (C'^(i?(a;i r))) 

of the students in the neighborhood should be small. To 
mathematically support this intuition, we consider the residuals 
Ci^y^k in a neighborhood B{x,r) of feature vector x with 
distribution For any confidence interval e the 

probability that the absolute difference between the unknown 
residual Cx,fc of the student with feature vector x and the 
expected value of the residual distribution p*^ (x, r) in his/her 
neighborhood is smaller than e can be bounded by 


P [|C>(B(...))-/(.,.)! <c] > 

" ( 6 ) 

This statement directly follows from Chebyshev’s inequality. 

We conclude that the lower the variance of the residual dis¬ 
tribution in the neighborhood, the more confident we are that 
the true residual Cx,fc will be close to p^(x, r). Since both the 
expected value p^(x, r) and the variance War (C^{B{x,r))^ 
of the distribution are unknown, we estimate the two values 
through the sample mean from Q and the sample variance 
I^ar {C^{B{x,r))) given by 


Var {C'^{B{x,r))) 


ExgB(x,r) {c^,k-fLHx,r)y 
|B(x, r)| - 1 


(7) 


In the following we use Var^{x,r) := War {C^{B{x, r))) to 

denote the variance and Var (x,r) := Var (^C^{B{x,r))) 
to denote the sample variance of the residual distribution 
in neighborhood B{x,r). From the law of large number 


it follows that the sample mean and the sample variance 
converge to the true expected value and the true variance for 
|i?(x,r)| —> oo. We will provide a bound for the probability 
that the prediction error is larger than a given value in the 
theorem below. Given a desired confidence interval e, we 
define the confidence on the prediction of the residual as 


g(B(x,r)) = 1 - 


Var (x,r) 


( 8 ) 


Using this confidence measure the radius of the optimal 
neighborhood after performance assessment k is given by r* = 


argmaXj. q {B {xi^y^k,r)) = argmin^ Uar {x,r). To esti¬ 
mate r* after each performance assessment k, our algorithm 
considers M different neighborhoods B{xi^y^k,rm),m = 
with user-defined radii and chooses the best 
neighborhood mfc(xi j, according to our confidence measure 
Wfe(x*,y,fc) = argmax„q(S(xi_y,fe,rm)). In the following 
we use mfc := rhk{^i,y^k) to denote the best neighborhood. 
Let 




(9) 


denote the estimated residual of the best neighborhood at time 
k and Zi^y^k denotes the corresponding estimated overall score 

k 

^i,y^k — Ci^y^k T ^ ^ . ( 10 ) 

1=1 

If the confidence bound for the best neighborhood qi^y{k) = 
q{B {Xi^y^k,rrhk)) is above a given threshold qi^y{k) > qth, 
the algorithm returns the final prediction of the overall score 
Zi,y = Zi^y^k for the considered student. 

If the confidence is below the threshold, we wait for the 
next performance assessment and start the next iteration. Fig. 
[^illustrates the neighborhood selection process. Algorithm [T] 
provides a formal description of the grade prediction algorithm 
in pseudocode. 

To conclude the discussion of the grade prediction algorithm 
in the regression setting, we derive a bound for the probability 
that the prediction error is larger than a value e. Before we 
state the theorem, we introduce some further notations. Let 
m*y, (x) denote the index of the neighborhood with the smallest 
variance of residuals for the student with feature vector x at 
time k 

m^(x) = argmin Uar^(x,rm). (11) 

l<m<M 


Note that m^(x) is not necessarily equal to mk{x.), the index 
of the neighborhood with the highest confidence chosen by 
our algorithm, since the confidence defined in ( |8|) is calculated 
with the known sample variance of residuals Uar(x,r) and 
not with the unknown true variance Uar^'(x, r) used in ( 111 . 

Similarly 2 (x) denotes the index of the neighborhood 
with the second highest confidence. 

2 (x) = argmin Uar*(x,rm). 

(x) 


Let A/j(x) denote the difference between the standard devia¬ 
tions of the residual distribution of neighborhoods m^. (x) and 

K. 2 W 

Afc(x) = ^Var’^{x,rmi J - ^Jvar^x^r^). (12) 
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Fig. 2. Illustration of the neighborhood selection process. 


Algorithm 1 Grade Prediction Algorithm, Regression Setting 
Input: All X and z from past years, qtt, number M and radii 
ri,... ,rM of neighborhoods 

Output: Predictions z for the overall scores of the students 
1 : for all years y do 

2 : for all performance assessments k do 

3: for all current-year students i for whom the final 

prediction has not been made do 
4: if k = K then 

5: Calculate ^ according to Q 

6: Return z^ ^ as final prediction for student i 

7: end if 

8 : Create M neighborhoods with radii ri,..., tm 

9: for all neighborhoods m do 

10 : Estimate residual c{B with (|^ 

11 : Compute Var{x,rm) with W 

12 : Compute g (i?(x ^,y.k,rm)) with ^ 

13: end for 

14: Find rhk = argmax„ g (S (xi_y_fe, r„)) 

15: if q{B > qth then 

16: Compute Zi^y with (j^ 

17: Return Zi^y as final prediction for student i 

18 : end if 

19: Add :x.i^y.k and ai^y^k to database 

20 : end for 

21 : end for 

22 : Calculate all Ci^y^k of year y according to 0 

23: Add all Ci^y^k to database 

24: end for 


Theorem. Without loss of generality we assume that all 
scores a are normalized to the range [0,1]. Consider the 
prediction zi^y k of the overall score of yth year student i 


with feature vector x made by algorithm The probability 
that the absolute error the prediction exceeds e is bounded by 


P WZ^.V - Z; 


'i,y,k\ > e] < 


AVar'^ (^x,r^.(x)) 


2 exp 
2M exp 


—e mm 
l<m<M 


l-B(x,r„)| 


—Afe(x)^ min 
l<m<M 


|B(x,rm)| - 1 


Proof: See Appendix. ■ 

This theorem illustrates two important aspects of algorithm 
First, we see that for a given neighborhood the accuracy 
of our predictions increases with an increasing number of 
neighbors. Hence, our algorithm learns the best predictions 
online as the knowledge base is expanded after each year, 
when the feature vectors and results from the past-year stu¬ 
dents are added to the database. In Section IIV-DII we show 
that this learning can be experimentally illustrated with our 
data from the digital signal processing course taught at UCFA. 
Second, the term Var^ (x, r™*)/e^ shows that the prediction 
accuracy will be higher if the variance of the residuals in a 
neighborhood is small. With increasing time k we expect this 
variance to decrease since we have more information about 
the students and we expect the students in a neighborhood to 
be more similar and achieve similar (residual) scores. 

Note that it is possible to restrict the data kept in the 
knowledge base to recent years, which allows the algorithm 
to adapt faster to slowly changing students and to changes in 
the course. 


D. Grade Prediction Algorithm, Classification Setting 

In the binary classification setting we predict the overall 
score analogously to the regression setting and then determine 
the class by comparing the predicted overall score Zi^y with 
a threshold score Zth- To illustrate how we find Zth let us 
assume that we want to predict whether a student does well 
(letter grades > B—) or does poorly (letter grades < C+). To 
determine zth, we find the average Zavg,B- of all students from 
past years who received a B— and the average Zavg,c+ of all 
students from past years who achieved a C-\-. Subsequently, 
we define Zth as Zth = {Zavg,B- + Zavg,c+) /2. The predicted 
classification bi^y of yth year student i is then given by 


0 Zi^y > Zth 
1 Zi,y Zth- 


(13) 


We are more confident in the classification not only if the 
variance of the neighbor-scores is small, which is the metric 
we used for the confidence in the regression setting, but also 
if the distance d{B{x,r)) = |z(i?(x,r)) — Zth\ between the 
predicted score and the threshold score is large. Note that 
z{B{-K,r)) is the estimate of the overall score based on 
neighborhood B{x,r). Because of this intuition we use a 
modified confidence 


g"*" (S(x,r)) = l-e 




(14) 
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to decide whether to make the final prediction in binary 
classification settings. Since d {B{x, r)) should only influence 
whether the final prediction is made for a given neighborhood 
but not the neighborhood selection process, we still use the 
unmodified confidence from ([^ to select the optimal neigh¬ 
borhood. 

In summary, four changes have to be made to algorithm [T] to 
make it applicable to binary classification settings. First, zth 
has to be determined/updated at the beginning of each new 
year. Second, we calculate 6 ,, „ aft er line 


Third, we return 5^ y at line 


17 

STrT 


16 


according to (13 1 . 
instead of Zi^ Fourth, we 


use the modified confidence according to (14i in line 15 
instead of the unmodified confidence q. We use the unmodified 
confidence from ([^ in line 

The described binary classification algorithm can easily be 
generalized to a larger number of categories. In a classifi¬ 
cation with L categories, we define L — 1 threshold values 
Zth,i < Zth,2 < ... < Zth,L-i and determine in which of 
the L score intervals {[0, Zth.i), [zth,i, Zth,2), • ■ ■, [zth,L-i, 1]} 
the predicted overall score z, y of a student lies. The index of 
the interval corresponds to the classification of the student. 
In this general classification setting, the modified confidence 


from (14 1 can be used as well by defining d as the distance 


of Zi^y to the nearest threshold value. 


We discuss the performance of the proposed algorithm in 


both regression and classification settings in Section IV-D 


E. Confidence-Learning Prediction Algorithm 

Besides the radii of the neighborhoods r^, the only param¬ 
eter to be chosen by the user in algorithm [T] is the desired 
confidence threshold qtu- Since for an instructor it is more 
natural and practical to specify a desired prediction accuracy 
or error directly rather than the confidence threshold, we show 
in this section how to automatically learn the appropriate con¬ 
fidence threshold to achieve a certain prediction performance 
and what consequences this has on the average prediction time. 
We will discuss a possible way of choosing the radii of the 
neighborhoods in Section IV-B| 

Formally we define the problem as follows. Let p{k, qth) € 
[ 0 , 1 ] denote the proportion of current year students for 
which the grade prediction algorithm working with confidence 
threshold qth < 1 has predicted the overall score by time 
(performance assessment) k G [ 0 ,iT]. Pmin is the minimum 
percentage of current year students whose grade the user wants 
to predict with a specified accuracy. E{k, qth) > 0 denotes the 
average absolute prediction error up to time k for the given 
confidence. E^ax is the maximum error the user is willing to 
tolerate. k{p, qth) is the time necessary to predict the grade 
for proportion p of all students of the class using confidence 
threshold qth- Please note that since the variables p, E, qth and 
k are dependent, we can only independently specify two of the 
four variables. If we for example specify to predict all (p = 1) 
students with zero error (E = 0), the algorithm will have to 
wait until the end of the course when the overall score is 
known (k = K) and will use maximum confidence (qth = !)• 
Without making any assumptions on the dependence of the 
variables of each other, multiple pairs (k, qth) might lead to 
the same specified pair {p,E). 


Algorithm 2 Confidence-Learning Prediction Algorithm 
Input: Emax, Pmin, qth,o, X and z from past years, number 
M and radii ri,, tm of neighborhoods 
Output: Predictions z, ky and qth,y for all years 

1 : for all years y do 
if y > 1 then 


Find ky-i and qth,y-i according (15i by running 
algorithm [T] with various qth 
Return ky-i and qth,y-i 

end if 

Use algorithm 0 with qth = qth,y-i to predict and 
return the grades of current year y students 

7: end for 


Our goal is, therefore, to learn from past data the yth 
year estimate of the minimal time ky and corresponding 
confidence threshold qth,y necessary to achieve the desired 
share Pmin of students predicted and the desired maximum 
average prediction error Emax- This is formally defined as: 

minimize k{p,qth) 
qth 

subject to p{k,qth)>Pmin 

E(k, qth) E-fjiax 

Note that while the goal of optimization problem 0 is to find 
the minimum time to predict the overall score of a particular 
student with a desired confidence, this problem ( [T5] l aims at 
finding the minimum time by which the overall scores of a 
specific percentage of all students can be predicted with a 
desired maximum error. 

At a given year y we solve this optimization problem 
using a brute force approach using the all available data 
from years \,... ,y. For this purpose, we extract k, p and E 
from algorithm for a large number of different confidence 
thresholds qth- We then select the confidence threshold qth,y 
which is optimal with respect to optimization problem 0 
and determine the corresponding prediction time ky- To make 
the grade predictions for year y -f 1 we use the learned 
confidence threshold qth,y as input to prediction algorithm 
Since there are no training data available yet at year y = 1, 
the algorithm uses a user-defined starting value qth,o for the 
grade predictions of the first year. Algorithm summarizes 
the learning algorithm in pseudocode. 

IV. Experiments 

In this section, we present the data, discuss details of 
the application of algorithm [T] to our dataset, illustrate the 
functioning of the algorithm and evaluate its performance 
by comparing it against other prediction methods in both 
regression and binary classification settings. Due to space limi¬ 
tations, we will not show experimental results for classification 
settings with more than two categories. 

A. Data Analysis 

Our experiments are based on a dataset from an under¬ 
graduate digital signal processing course (EE113) taught at 


















UCLA over the past 7 years. The dataset contains the scores 
from all performance assessments of all students and their 
final letter grades. The number of students enrolled in the 
course for a given year varied between 30 and 156, in total 
the dataset contains the scores of approximately 700 students. 
Each year the course consists of 7 homework assignments, one 
in-class midterm exam taking place after the third homework 
assignment, one course project that has to be handed in after 
homework 7 and the final exam. The duration of the course 
is 10 weeks and in each week one performance assessments 
takes place. The weights of the performance assessments are 
given by; 20% homework assignments with equal weight on 
each assignment, 25% midterm exam, 15% course project 
and 40% final exam0 Fig. 3a shows the distribution of the 
letter grades assigned over the 7 years. We observe that on 
average B is the grade the instructor assigned most frequently. 
A was assigned second most and C third most frequently. 
Surprisingly, however, the distribution varies drastically over 
the years; in year 1 for example only 18.75% received a B 
while in year 6 the frequency was 38.9%. 

To understand the predictive power of the scores in different 


performance assessments. Fig. 3b shows the sample Pearson 
correlation coefficient between all performance assessments 
and the overall score. We make several important observa¬ 
tions from this graph. First, on average the final exam has 
the strongest correlation to the overall score, followed by 
the midterm exam. This is not surprising, since the final 
contributes 40% and the midterm contributes 25% to the 
overall score. Second, the score from the course project on 
average does not have a higher correlation with the overall 
score than the homework assignments despite the fact that it 
accounts for 15% of the overall score. Third, all homework 
assignments have similar correlation coefficients. Fourth, the 
correlation between the individual performance assessments 
and the overall score varies greatly over the years. This 
indicates that predicting student scores based on training data 
from past years might be difficult. 

Since all performance assessments are part of the overall 
score and, therefore, a high correlation is expected, it is also 
informative to consider the correlation between the perfor¬ 


mance assessments and the final exam shown in Fig. 3c It 
is interesting to observe that still the midterm exam shows, 
besides the overall score, the highest correlation with the final 
exam. A possible explanation for this is that both the midterm 
and final are in-class exams while the other performance 
assessments are take-home. 


B. Our Algorithm 

In this section we discuss four important details of the ap¬ 
plication of algorithm [T] to the dataset from the undergraduate 
digital signal processing course. 

First, the rule we use to normalize all scores Qi^y^k in our 
dataset is given by 


^i,y,k 


h,y,k f^y,k 


( 16 ) 


^As we explain in footnote M and show in section IV-D2 our algorithm 


can also be applied to settings ^ere the number and weights of performance 
assessments change over the years. 


where is the original score of the student, is 

the sample mean of all yth year student’s original scores 
in performance assessment k and ay is the standard de¬ 
viation of all ylh year student’s original overall scores. A 
normalization of the scores is needed for several reasons. 
First, the instructor-defined maximum score in a particular 
performance assessment may differ greatly across years and 
since we use data from past years to predict the performance 
of students in a given year, we need to make the data across 
years comparable. Second, also the difficulty of individual 
performance assessments might be different across years, 
homework 2 might for instance be very easy in year 2 so 
that almost everyone achieves the maximum score and very 
difficult in year 3 so that few achieve half of the maximum 
score. The normalization according to eliminates this bias 
by transforming the absolute scores of a student to scores 
relative to his/her classmates of the same year. Note that 
algorithm does not require a specific normalization and it 
does not matter that the normalized scores according to (T6|) 


will not be in the interval [0,1] as assumed in Section III for 
simplicity. 

Second, we use feature vectors that simply contain the 
scores of all performance assessments student i has taken up to 
time k in the order they occurred yii^y^k = , ai,y,fe)- 

To incorporate the fact that students who have performed sim¬ 
ilarly in a performance assessment with a lot of weight should 
be nearer to each other in the feature space than students 
that have had similar scores in a performance assessment (e.g. 
homework assignment) with low weight, we use a weighted 
metric to calculate the distance between two feature vectors. 
We define the distance of two feature vectors Xi,x, G as 


x,:,x 


')k — 


1^1=1 U’l \x^,l-Xj^l\ 

E k 

1 = 1^1 


(17) 


where k is the length of the feature vectors, wi is the weight 
of performance assessment I and Xij denotes entry I of feature 
vector Xi. 

Third, rather than specifying the radii of the neighbor¬ 
hoods to consider as an input, as suggested in the pseudo¬ 
code of algorithm [T] we automatically adapt the radii of the 
neighborhoods such that they contain a certain number of 
neighbors. Since the sample variance gets more accurate with 
an increasing number of samples, we refrain from consid¬ 
ering neighborhoods with only 2 neighbors. Therefore, the 
smallest radius considered ri is the minimal radius such 
that the neighborhood includes 3 neighbors. For subsequent 
neighborhoods the minimal radius is chosen such that the 
neighborhood includes at least one neighbor more than the 
previous neighborhood. Formally, we define the selection of 
the radii recursively as 


ri =minr, s.t. \B{xi^y^k,r)\ > 3 
rm+i =minr, s.t. \B{xi^y^k,r)\ > 


( 18 ) 


Fourth, to be able to apply our algorithm in settings in which 
structure of the course (e.g. the number, weight and sequence 
of assessments) changes across years, we need to pre-process 
the data from past years. In particular, the data from past years 
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Fig. 3. Data analysis:]^ shows the distribution of letter grades for all years.andpresent the sample Pearson correlation coefficient between individual 
performance assessments and the overall \3b) or final exam a score. Note that we use the abbreviations Hz (homework assignment z), M (midterm exam), 
F (final exam) and O (overall score) in the figures. 


is pre-processed so that the number and sequence of in-class 
and take-home assessments is the same as in the current year. 
In addition, to identify the most similar students, it is important 
that the performance assessments that cover the same topic are 
at the same place of the sequence/feature vector and, therefore, 
are compared to each other. Consequently, we might have to 
pre-process the data from past years even if the total number 
of performance assessments is the same. We use two different 
types of modifications to pre-process data from past years. 

Modification 1 applies to cases where a topic of the course 
was tested with a larger number of performance assessments 
in a past year than in a current year. For example, consider 
a signal processing course which contained two homework 
assignments on the Fast Fourier Transform in year 1 but the 
same topic was covered in only one homework assignment 
in year 2. In this case, the two performance assessments 
on the same topic from year 1 are combined to a single 
assessment. If N assessments are combined, the score of the 
combined assessment acomb is calculated based on the weights 
wi,... ,wn of the past assessments with scores oi,...,ojv 
according to 


b^comb — 


Z^k—1 




(19) 


Wk 


Modification 2 applies to the case where a topic of the 
course was tested with a lower number of performance assess¬ 
ments in a past year than in a current year. In this case the 
past-year performance assessment on this topic is duplicated. 
Note that through duplication this performance assessment 
gets more weight in the process of selecting similar students. 
This is desired because the instructor probably uses more 
performance assessments to test a certain topic because he 
thinks that this topic is very important and hence it will be 
informative in terms of predicting the grade. 

Finally, if necessary the sequence of performance assess¬ 
ments from the past years is reordered to match the sequence 
of performance assessments from the current year. The re¬ 
ordering has to be done so that the performance assessments on 
the same topic are at the same position of the sequence/feature 
vector in both years. Additionally, in-class assessments should 
only be compared to in-class assessments and take-home 


assessments should only be compared to take-home assess¬ 
ments. After this pre-processing of past-year data, the standard 
Algorithm [T] can be applied to make the predictions. Note 
that our algorithm always uses the weights of the current- 
year course to find students similar to the student for whom 
it needs to issue a personalized grade prediction and does not 
consider the weights that were used in past-year courses for the 
various assessments. The grade predictions for the current-year 
course are usually made based on data from several past-year 
courses. The data for each of the past years might have to be 
pre-processed separately. 


C. Benchmarks 

We compare the performance of our algorithm against five 
different prediction methods. 

• We use the score ai^y^k student i has achieved in the 
most recent performance assessment k alone to predict 
the overall grade. 

• A second simple benchmark makes the prediction based 

on the scores • ■ •, student i has achieved 

up to performance assessment k taking into account the 
corresponding weights of the performance assessments. 

• The fc-Nearest Neighbors algorithm with 7 neighbors. 
This number provided the best results with training data 
from the first year. 

• Linear regression using the ordinary least squares (OLS) 
finds the least squares optimal linear mapping between 
the scores of first k performance assessments and the 
overall score. 

• In classification settings we use logistic regression instead 
of linear regression. 

• Support vector machines (SVMs) are used in the classi¬ 
fication setting. 

The advantage of the method we use in our algorithm over 
linear and logistic regression is that being a nearest neighbor 
method, it is able to recognize certain patterns such as trends 
in the data that are missed in linear/logistic regression where 
a single parameter per performance assessment has to fit all 
students. In contrast, our algorithm is able to detect such 
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TABLE II 

Case Study: Illustrative Example 



HW I 

HW 2 

HW 3 

Midterm 

Do Well/Poorly 

Student I 

0.53 

0.00 

-0.37 

-1.35 

Poorly 

Student 2 

1.07 

0.87 

-0.30 

-1.06 

Poorly 

Student 3 

-1.39 

-1.54 

-2.15 

0.50 

Well 


patterns if there have been students in the past who have shown 
similar patterns. 

Table [I^ illustrates this through a case study extracted 
from the UCLA undergraduate digital signal processing course 
data. We present cases from a simulation where we predicted 
whether students are going to do well (letter grade > B—) 
or do poorly (letter grade < C+) and consider the students 
for which our algorithm decided to predict after the midterm 
exam. The table shows 3 students whom logistic regression 
classified wrongly while our algorithm made the accurate 
prediction. In columns 2-4 we present the scores the students 
achieved up to the midterm exam and the last column shows 
the true classification of the students. These cases are typical 
examples of settings where our algorithm outperforms logistic 
regression. Student 1 and 2 both showed a good performance 
in homework assignment 1. However, in later assignments and 
especially at the midterm exam their performance successively 
deteriorated, an indication that the students might do poorly 
the class if they or the instructor and teaching assistants do 
not take corrective actions. Our algorithm is likely to have 
learned such patterns from past data and predicts the students 
to do poorly. On average, however, their performance in the 
first four performance assessments is still about average and, 
therefore, logistic regression predicts that the students will do 
well. For student 3 the situation is the other way around. 


D. Results 

In this section we evaluate the performance of our algorithm 
[T] in different settings and compared to benchmarks in both 
regression and the classification tasks. 

As a performance measure in the regression setting, we use 
the average of the absolute values of the prediction errors 

E. Since we normalized the overall score to have zero mean 
and a standard deviation of 1, i? directly corresponds to the 
number of standard deviations the predictions on average are 
away from the true values. The overall performance measure 
in classification settings is the accuracy of the classification. 
Furthermore, we use the quantities precision, recall and false 
positive/negative rate besides accuracy to measure perfor¬ 
mance. Please note that positive in our case means that the 
student does poorly. 

1) Performance Comparison with Benchmarks in Regres¬ 
sion Setting: Having discussed the various performance mea¬ 
sures, we first address the regression setting. Fig. visualizes 
the performance of the algorithm we presented in Section 


predicting the overall scores of all students from years 2—7. To 
make the prediction for year y, we used the entire data from 
years 1 to y — 1 to learn from. Unlike our algorithms, the 
benchmark methods do not provide conditions to decide after 


III-C and of benchmark methods. We generated Fig. 0by 



Average Prediction Time (Performance Assessment) 


Fig. 4. Performance comparison of different prediction methods. 


which performance assessment the decision should be made. 
Therefore, for benchmark methods we specified the prediction 
time (performance assessment) k for an entire simulation and 
repeated the experiment for all fc = 1 ,..., 10; the results are 
plotted in Fig. To generate the curve of our algorithm 
we ran simulations using different conhdence thresholds qth 
and for each threshold we determined E and the performance 
assessment (time) k after which the prediction was made on 
average. 

Irrespective of the prediction method. Fig. shows the 
trade-off between timeliness and accuracy; the later we pre¬ 
dict the more accurate our prediction gets. From the curve 
for the prediction using a single performance assessment 
we infer that there is a low correlation between homework 
assignments/course project and the overall score and a high 
correlation between the in-class assessments (midterm and 
hnal exam) and the overall score. This observation is con¬ 
gruent with the correlation analysis from Section |IV-A| If 
the prediction is made early, before the midterm, all methods 
(except the prediction using a single performance assessment) 
lead to similar prediction errors. We observe that while the 
error decreases approximately linearly for our algorithm, the 
performance of benchmark methods steeply increases after the 
midterm and the final but stays approximately constant during 
the rest of the time. The reason for this is that we obtained 
the points of the curve for our algorithm by averaging the 
prediction time of all students. Therefore, the point of the 
curve above the midterm was not generated by predicting 
after the midterm for all students; some predictions were made 
earlier, some later. If on average the prediction is made after 
homework 4, our algorithm shows a significantly smaller error 
E than benchmark methods outperforming linear regression by 
up to 65%. 

2) Learning across Years and Instructors in Regression 
Setting: Consider Fig. demonstrating the performance in¬ 
crease of our algorithm when more data to learn from become 
available. To generate the hgure, we used our algorithm to 
predict the overall scores of all 7th year students for different 
conhdence thresholds. The curves in dashed lines stem from 
simulations using only one of the years 1-5 as training data 
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Fig. 5. Illustration of learning from past data: EiTor of grade predictions for 
year 7 depending on training data. 



Average Prediction Time (Performance Assessment) 

Fig. 6. Illustration of learning across instructors: Error of grade predictions 
for year 7 depending on training data. 


and the solid magenta curve uses all years 1-5 to learn from. 
We observe that the prediction performance strongly depends 
on the training data and differs if different years are used. 
Most importantly, the performance is highest irrespective of 
the average prediction time if the combination of the data from 
all 5 years is used. This shows that our algorithm is able to 
learn and improves its predictions over time. 

The undergraduate digital signal processing course is taught 
twice a year by three different instructors at UCLA. While we 
used only the data from one instructor in the previous plots, 
Fig. § investigates the situation when we predict the grades 
for a class of instructor 1 based exclusively on past data from 
a different instructor 2. In practice this happens when a new 
instructor takes over a course previously taught by someone 
else. It is interesting to see whether our grade prediction still 
works well in this setting. A good performance is not self- 
evident for several reasons. Different instructors might set a 
different focus concerning the knowledge imparted, they might 
use a different textbook and they might prefer different styles 
of homework assignments and in-class exams. Furthermore, 
the structure of the course, e.g. the number and sequence of 
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Fig. 7. Comparison of prediction time and accuracy between the UCLA 
course EEl 13, which contains a midterm exam, and the UCLA course EE103, 
which contains four in-class quizzes instead of a midterm exam. Note that 
the tick labels Qj/Hf above the plot stand for quiz/homework i and that for 
EE 103 there are weeks in which both a homework and a quiz take place. 


homework assignments, the time when the midterm exam takes 
place, the weights of performance assessments and whether a 
course project and quizzes exist, might change drastically. To 
generate Fig. we predicted the overall score for the year 7 
class of instructor 1 based on two different sets of previous 
data. The solid blue curve was generated by using the data 
from the classes in years 1-5 from the same instructor 1 as 
training data. To obtain the dashed red curve, we used the data 
from classes in years 1-5 from instructor 2 to learn from. While 
the predictions using training data from the same instructor 
are slightly more accurate, the performance with training data 
from a different instructor is still very satisfying, showing a 
good robustness of our algorithm with respect to different 
instructors. For the subsequent results we again exclusively 
use data from one instructor. 


3) Performance Comparison with Course Containing Early 
Quizzes in Regression Setting: The results in both the data 
analysis section (Fig. [Jb]) and Section IV-Dl (Fig. indicate 
that scores in in-class exams are much better predictors of the 
overall score than homework assignments. To verify this, we 
consider two consecutive years of the UCLA course EE103, 
which contains four in-class quizzes in course weeks 2, 4, 6 
and 8 instead of a midterm. Eig. visualizes that, starting 
from the first quiz in week 2, indeed our algorithm is able 
to predict the same percentage of the students with an up to 
22 % smaller cumulative average prediction error by a certain 
week. We generated Eig. |7]by using algorithm [T] to predict for 
both courses the overall scores of the students in a particular 
year based on data from the previous year. Note that for the 
course with quizzes, the increase in the share of students 
predicted is larger in weeks that contain quizzes than in weeks 
without quizzes. This supports the thesis that quizzes are good 
predictors as well. 

According to this result, it is desirable to design courses 
with early in-class exams. This enables a timely and accurate 
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Fig. 8. Performance comparison between our algorithm and logistic regression 
using accuracy, precision and recall for binary do well/poorly classification. 


Fig. 9. Cumulative prediction time, accuracy, false positive and false negative 
error rates for a binary do well/poorly classification with fixed 


grade prediction based on which the instructor can intervene 
if necessary. 

4) Performance Comparison with Benchmarks in Classifi¬ 
cation Setting: The performances in the binary classihcation 
settings are summarized in Fig. Since logistic regression 
turns out to be the most challenging benchmark in terms of 
accuracy in the classihcation setting, we do not show the 
performance of the other benchmark algorithms for the sake 
of clarity. The goal was to predict whether a student is going 
to do well, still dehned as letter grades equal to or above 
B—, or do poorly, dehned as letter grades equal to or below 
C+. Again, to generate the curves for the benchmark method, 
logistic regression, we specihed manually when to predict. 
For our algorithm we again averaged the prediction times of 
an entire simulation and varied qth to obtain different points 
of the curve. Up to homework 4, the performance of the two 
algorithms is very similar, both showing a high prediction ac¬ 
curacy even with few performance assessments. Starting from 
homework 4, our algorithm performs signihcantly better, with 
an especially drastic improvement of recall. It is interesting 
to see that even with full information, the algorithms do not 
achieve a 100% prediction accuracy. The reason for this is 
that the instructor did not use a strict mapping between overall 
score and letter grade and the range of overall scores that lead 
to a particular letter grade changed slightly over the years. 

5 ) Decision Time and Accuracy in Classification Setting: 
To better understand when our algorithm makes decisions and 
with what accuracy, consider Fig. We again investigate 
binary do welFpoorly classihcations as discussed above. The 
red curve shows (square markers) for what share of the total 
number of students the algorithm makes the prediction by a 
specihc point in time. The remaining curves show different 
measures of cumulative performance. We can for example see 
that by the midterm exam we classify 85% of the students 
with an accuracy of 76%. These timely predictions are desir¬ 
able since the earlier the prediction is made the more time 
an instructor has to take corrective action. The cumulative 
accuracy stays almost constant around 80% irrespective of 
the prediction time. We believe that the reason for this is 


that thanks to the conhdence threshold, the easy decisions are 
made early and harder decisions are made later. Consequently, 
the expected accuracy of all predictions remains more or less 
constant irrespective of the prediction time. 

V. Conclusion 

In this paper we develop an algorithm that allows for 
a timely and personalized prediction of the final grades of 
students exclusively based on their scores in early perfor¬ 
mance assessment such as homework assignments, quizzes 
or midterm exams. Using data from an undergraduate digital 
signal processing course taught at UCLA, we show that the 
algorithm is able to learn from past data, that it outperforms 
benchmark algorithms with regard to accuracy and timeliness 
both in classihcation and regression settings and that the 
predictions are robust even when the course is taught by 
different instructors. 

We show that in-class exams are better predictors of the 
overall performance of a student than homework assignments. 
Hence, designing courses to have early in-class evaluations 
enables timely identihcation of students who, with a high 
probability, would do poorly without intervention and enables 
remedial actions to be adopted at an early stage. 

Our algorithm can easily be generalized to include context 
data from students such as their prior GPA or demographic 
data. If applied exclusively to MOOCs, the in-course data 
used for the predictions could be extended for example by 
the responses of students to multiple-choice questions, their 
forum activity, the course material they studied or the time they 
spent studying online. Another direction of future work is to 
apply our algorithm in practice and investigate to what extent 
the performance of students can be improved by a timely 
intervention based on the grade predictions. In this context, 
our algorithm could be extended to make multiple predictions 
for each student to monitor the trend in the predicted grade 
after an intervention. 

One example for an intervention would be that the instructor 
provides additional study material to students with a low 
predicted grade. Alternatively, teaching assistants could spend 
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additional time with these selected students to go through im¬ 
portant topics again. In a MOOC setting, the intervention could 
take place in a fully automated way, for example by presenting 
the students additional study material in a personalized way 
using techniques discussed in ||^. To make students aware 
of their performance, they could be asked to predict their own 
overall grade and as a comparison the instructor could disclose 
the prediction of our algorithm to the students. 

Appendix 

In this Appendix, we proof the theorem from section |III-C| 
Before we start with the proof, we discuss some preliminary 
results. 

Fact 1. (Chernojf-Hoejfding Bound) Let Xi,X 2 , ■ ■ ■ ,Xn be 
independent and bounded random variables with range [0,1] 
and expected value fi. Let = (Xi + ... + Xn)jn denote 
the sample mean of the random variables. Then, for all e > 0 


our algorithm chooses the optimal neighborhood to^(x). 
Therefore, we get 
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arg min 

l<m<M 




< p 
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l<m<M 
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< 2_^ 2 exp 
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Var {x,rm) - \/Var'^{x,rm) 
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< 2M exp 
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\B{x,rm) \ - 1 


Afc(x) 
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PI 


— mI > e] A 2 exp [—2ne^] . 


Proof: A proof of Fact [T] can be found in Hoeffding’s 
paper |32|. ■ 

Fact 2. (Empirical Bernstein Bound) Let n > 2 and 
Xi,X 2 ,..., Xn be independent and bounded random random 
variables with range [0,1] and variance Var. denotes the 
n-sample mean fin = ^ Sr=i Vcirn denotes the n- 

sample variance Varn = ~ Bn)'^■ Then, the 

following inequality bounds the probability that the error of 
the sample standard deviation, which is the square root of the 
sample variance, is larger than a given value 
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Varn — Vv' 


ar 




n — 1 T 

> e 

< 2 exp 

2 ^ _ 


can be derived. 

Proof: See p?) for a proof of Fact ■ 

Lemma 1. Let 7jik{x) denote the index of the neighborhood 
selected by our algorithm for the student with feature vector 
X at time k and m*j.(x) is given by M denotes the 

total number of neighborhoods our algorithm considers and 
Afc(x) is given by {12 L We can bound the probability that our 
algorithm chooses the wrong neighborhood by 


— Afcfx) min 

P [TOfc(x) ^ m^(x)] < 2e l<rr^<M 

Proof: Consider: 

P[mfc(x) ^ TOfc(x)] 


( 20 ) 


= P 


= P 


arg min Var {x,rm) 7^ rn*f,{x) 

l<m<M 


arg min y Var (x,rm) m-Kx) 

l<m<M 


If the estimation error of the standard deviation is smaller than 
Afc(x)/2 for all neighborhoods 

^ Afc(x) 


Var (x,rm) - yVar^(x,rm) 


where (a) is the union bound and (b) follows from Fact ■ 
Proof of Theorem: Note that 




(a) 


iiVik ^ ^ I Ci,y,k H” ^ ^ '^l^i,y,,l 


= \c-i,y,k — ^i,y,k\ 


Z =1 


where (a) follows from equations 0 and 

There are three sources of error in the prediction of an 
overall score of algorithmic 

1) The wrong neighborhood size may be selected due to 
inaccurate approximations of the true residual score vari¬ 
ances of the neighborhoods through the sample variance. 

2) If the optimal neighborhood is selected, the sample mean 
of the residual scores in the neighborhood may not be a 
good approximation of their true mean. 

3) Even if the optimal neighborhood is selected and the 
sample mean equals the true mean, the residual score of 
the considered student may be different from the mean 
of the residual score distribution. 

In the following we separate these three error sources and 
derive a bound for each one. 

We have 
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where (b) follows from (c) is the law of total probability 
and (d) and (e) both follow from the fact that P[A, B] < P[A]. 
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Lemma [^provides a bound for the second term. Therefore, 
focus on the hrst term 


P 
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where (/) follows from the triangle inequality, the fact that 
P[X + Y > xq + yo] < P [{X > xq} U {F > t/o}] and the 
union bound. The bound for the first term in step {g) follows 
from Chebyshev’s inequality and the bound for the second 
term follows from the Chernoff-Hoeffding Bound from Fact 
□ 

Including the second term again and using Lemma we get 
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which concludes the proof. 
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